Sveinbjornsson and Halldorsson BMC Bioinformotics 2012, 13(Suppl 6):S7 
http://www.biomedcentral.eom/1 471 -21 05/1 3/S6/S7 



Bioinformatics 



PROCEEDINGS Open Access 



PAIR: polymorphic Alu insertion recognition 

Jon Ingi Sveinbjornsson* Bjarni V Halldorsson 

From Second Annual RECOMB Satellite Workshop on Massively Parallel Sequencing 
Barcelona, Spain. 19-20 April 2012 



Abstract 

Background: Alu polymorphisms are some of the most common polymorphisms in the genome, yet few methods 
have been developed for their detection. 

Methods: We present algorithms to discover Alu polymorphisms using paired-end high throughput sequencing 
data from multiple individuals. We consider the problem of identifying sites containing polymorphic Alu insertions. 

Results: We give efficient and practical algorithms that detect polymorphic Alus, both those that are inserted with 
respect to the reference genome and those that are deleted. The algorithms have a linear time complexity and 
can be run on a standard desktop machine in a very short amount of time on top of the output of tools standard 
for sequencing analysis. 

Conclusions: In our simulated dataset we are able to locate 98.1% of Alus inserted with respect to the reference 
and 97.7% of Alus deleted, our simulations also show an excellent correlations between the deletions detected in 
parents and children. We further run our algorithms on publicly available data from the 1000 genomes project and 
find several thousand Alu polymorphisms in each individual. 



Introduction 

We consider the problem of detecting polymorphic Alu 
insertions from DNA sequence reads using high 
throughput paired-end sequencing data. 

Genomewide association studies (GWAS) proceed by 
identifying a number of individuals carrying a disease or 
a trait and comparing these individuals to those that do 
not or are not known to carry the disease/trait. Both 
sets of individuals are then genotyped for a large num- 
ber of Single Nucleotide Polymorphism (SNP) genetic 
variants which are then tested for association to the dis- 
ease/trait. GWAS have been able to successfully identify 
a very large number of polymorphism associated to dis- 
ease (e.g. [1-3]). Studies using tens of thousands of indi- 
viduals are becoming commonplace and are increasingly 
the norm in the association of genetic variants to disease 
[1-3]. 

Whole genome resequencing using next generation 
sequencers is rapidly becoming the sledgehammer of 
genomewide association studies. Increasingly, GWAS 
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are done in conjunction with the sequencing of number 
of individuals [4,5] or alternatively using variants identi- 
fied from the resequencing of a number of individuals 
[6]. Whole genome resequencing is preferable over SNP 
genotyping for association studies as it allows for the 
detection of all genomic variation and not only SNP var- 
iation. SNPs are the most abundant form of variation 
between two individuals. However, other forms of varia- 
tion exist, such as inversions, copy-number variations, 
LINE (Long INterspersed Elements) and SINE (Short 
INterspersed Elements) elements, including Alu 
insertions. 

Copy number variations, have been shown to be influ- 
ential factors in many diseases [7], and a number of 
methods have been proposed for the detection of struc- 
tural variants (e.g. [8-12]). Despite the fact that our 
computations indicate that the number of polymorphic 
Alu repeats carried by an individual are on a compar- 
able scale to the number of copy number variations car- 
ried by an individual, apart from [13], no reliable 
methods have been specifically developed for detecting 
Alu repeats in multiple individuals. Polymorphic Alus 
are also known to be good markers for constructing 



o 



© 2012 Sveinbjornsson and Halldorsson; licensee BioMed Central Ltd. This is an open access article distributed under the terms of the 
BiolVted C6ntTcll Creative Commons Attribution License (http://creativecommons.Org/licenses/by/2.0), which permits unrestricted use, distribution, and 
reproduction in any medium, provided the original work is properly cited. 



Sveinbjornsson and Halldorsson BMC Bioinformatics 2012, 13(Suppl 6):S7 
http://www.biomedcentral.eom/1 471 -21 05/1 3/S6/S7 



Page 2 of 9 



phylogenetics of homonid evolution [14] and determing 
human diversity [15]. 

An Alu sequence is an approximately 300 basepair 
long sequence derived from 7SL RNA gene [16]. Alu 
repeats are SINE that occur frequently in the human 
genome, as well as in other genomes. The Alu sequence 
family has been propagated to more then one million 
copies in primate genomes over the last 65 million 
years. Alu repeats are the largest family of mobile ele- 
ments in the human genome and the Alu family com- 
prises more then 10% of the human genome. Most Alu 
repeats were inserted early in primate evolution, where 
it is estimated that there was approximately one new 
Alu insertion in every primate birth [17]. 

Almost all of the recently integrated human Alu ele- 
ments belong to one of several small and closely related 
young Alu subfamilies, while other elements have been 
found to be largely orthologous to other primates. 
These largely human-specific AluY subfamilies represent 
approximately 0.5% of all the Alu repeats in the human 
genome. Our computations verify that AluY is the most 
polymorphic Alu family in our dataset. 

The current rate of Alu insertion is estimated to be of 
the order of one Alu insertion in every 200 births [18]. 
Some members of these young Alu subfamilies have 
been inserted into the human genome so recently that 
they are polymorphic with respect to the presence or 
absence of insertion in different human genomes. Those 
relatively few elements that are present in the genomes 
of some individuals and absent from others are referred 
to as Alu-insertion polymorphisms. The primary goal of 
this paper is the discovery of these Alu insertion 
polymorphisms. 

We give an algorithm targeted to finding Alu poly- 
morphism from next generation paired-end sequencing 
data. In what follows we will start by giving our problem 
framework, followed by a description of our algorithms 
and finally we show some experimental results. 

Methods 

Problem framework 

The input to our problem is a reference genome and a 
set of paired-end sequence reads from a set of indivi- 
duals. The genome sequence of the reference indivi- 
dual is known and will be highly similar, but not 
identical, to the genome of the individual(s) being 
sequenced. Paired-end sequencing reads consist of a 
read of a fixed length, followed by a short spacing, fol- 
lowed by another read. The spacing between the two 
reads follows a probability distribution, Y. Y can be 
assumed to be known a priori or to be easily estimated 
from the sequence reads [19] (cf. Additional file 1 for 
the estimation of Y). The two reads are substrings of 
DNA sequence, with one read being read from the + 



strand and the other being read from the - strand. The 
fact that the two reads are read in opposite direction 
ensures that; If the location of one of the reads is 
known then the location of the mate (the other read) 
is also known, up to Y. The genome sequence of the 
individual(s) being sequenced is however not known a 
priori, but is highly similar to the reference genome. 
At some locations in the reference genome the gen- 
omes of the reference and the individual(s) being 
sequenced will diverge. Some of this divergence is due 
to the insertion of Alu polymorphisms. A mechanism 
exists for Alu sequences to insert themselves into a 
genome while no such direct mechanism is known to 
exist for Alu sequences to remove themselves from the 
genome. Once inserted, the sequence will exist in the 
sequence context where it was inserted. 

When the polymorphic Alu is not contained in the 
reference, we consider the Alu to be inserted with 
respect to the reference. When the polymorphic Alu 
sequence is contained in the reference genome and 
some of the sequenced individuals we consider the Alu 
sequence to be deleted with respect to the reference, 
even though evolutionary the sequence most likely has 
been inserted. 

The output of our algorithm is a set of locations in 
the genome where an Alu sequence is inserted in some 
individual(s) as well as the sequence reads of the indivi- 
duals being studied for these insertions. As each indivi- 
dual contains two haplotypes a polymorphic Alu may be 
inserted on one, both or neither of these haplotypes. 

We formulate four versions of the problem of identify- 
ing Alus, when the Alu sequences are inserted or 
deleted with respect to the reference genome, both for 
identifying these polymorphism on a single individual 
and on multiple individuals. 
Problem 1 

Single Individual Deleted Alu identification problem 
Input A set of paired-end sequence reads from a single 
individual and a reference genome. 

Output A list of locations in the genome where an 
Alu is deleted with respect to the reference genome. 
Problem 2 

Multiple Individual Deleted Alu identification pro- 
blem Input A set of paired-end sequence reads from 
multiple individuals and a reference genome. 

Output A list of locations in the genome where there 
exists an individual with an Alu deleted with respect to 
the reference genome. 
Problem 3 

Single Individual Inserted Alu identification problem 
Input A set of paired-end sequence reads from a single 
individual and a reference genome. 

Output A list of locations in the genome where an 
Alu is inserted with respect to the reference genome. 
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Problem 4 

Multiple Individual Inserted Alu identification pro- 
blem Input A set of paired-end sequence reads from 
multiple individuals and a reference genome. 

Output A list of locations in the genome where there 
exists an individual with an Alu inserted with respect to 
the reference genome. 

Following the identification of polymorphic regions we 
need to determine which individuals are polymorphic 
for each polymorphism. 
Problem 5 

Alu genotyping problem Input A single location in the 
reference genome known to contain a polymorphic Alu. 
A set of individuals and a set of sequence reads for each 
individual. 

Output For each individual, a genotype call, assigning 
the individual 0, 1 or 2 copies of the given Alu, repre- 
senting an Alu on neither, one or both haplotypes. 

We start by giving the common algorithmic frame- 
work for our algorithms and then proceed to giving 
algorithms for each of the problems in turn. We start by 
describing our approach for the detection of deleted 
regions in a single individual. We then extend this to 
recognizing deletions in multiple individuals simulta- 
neously. We then show how these ideas can be extended 
to identifying inserted Alus, first in a single individual 
and finally in multiple individuals simultaneously. 

Algorithm framework 

Our algorithms start by mapping the sequence reads to 
the reference genome and analyzing the output of such 
a mapping. 
Alu Mate 

We start by preprocesing the sequence reads to make 
them easier for manipulation. The initial step of our 
algorithm is to map the sequencing reads to the human 
reference genome build 37 (hgl9) using the Burrows 
Wheeler Aligner (BWA) [20]. The program outputs a 
mapping of all sequence reads to the genome and also 
outputs whether there are alternate locations in the gen- 
ome with sequence alignment. An underlying assump- 
tion is that most of the reads are long and accurate 
enough that they will only map to a single location on 
the genome. Technology where each paired end is 100 
bases or greater with accuracy over 98% is readily avail- 
able and in use [4,5]. In random DNA the probability of 
such reads mapping to multiple places on the genome is 
extremely low. Reads mapping to Alu sequences how- 
ever will almost always have multiple places on the gen- 
ome that have similar quality mapping. Unless its mate 
is mapped to a proximal location, we will not use the 
mapping of such reads as input to our algorithm, but 
rather label such reads as Alu reads. We further align 
each read to the set of known Alu families and label 



those that align well to the database as Alu reads. Most 
paired-end mates of Alu reads will map uniquely to the 
genome. We note that from the mapping of the paired- 
end it is easy to determine whether the Alu sequence 
should be to the left or the right of the mapped 
sequence. 

A read pair is defined as improper if the two ends of 
the pair map to locations that are inconsistent with the 
read pair distance Y. We store all such improper pairs 
where one end is an Alu read and refer to the mate of 
those reads as Alu mates. Each of these read pairs either 
gives evidence of an Alu insertion or the read is impro- 
perly mapped or read. We label the Alu mate with an r 
if the mapped read is to the right of the Alu sequence 
and label them with a / if the mapped read is to the left 
of the Alu mapped read. The first step of our algorithm 
is to search for all Alu mates. At the same time we 
store the position and chromosome of the Alu mate, 
whether it is an / or an r read, to which Alu the Alu 
read mapped, to which Alu family that Alu belongs, 
where within the Alu the Alu read mapped and how 
many best matches to the reference genomes for the 
read where found by BWA. We term this algorithm Alu 
mate and we observe that it runs in time that is on the 
order of the number of reads. 

Lemma 1 

Algorithm Alu Mate runs in 0(n r ) time, where n r is 
the number of reads. 
Analysis of mapped reads 

The output of Alu mate is a mapping of sequence reads 
to the reference genome and an assignment of / and r 
read labels. 

Figure 1 shows the output of Alu mate and how it can 
be used to identify regions where an Alu is deleted with 
respect to the reference individual. Black arrows show 
the location and direction of the reads and the red lines 
show the insert between the reads. The location of the 
Alu is shown at the bottom of each figure. The leftmost 
figure shows an individual carrying the Alu on both of 
his chromosomes, notice that the distance between 
reads always follows the same distribution. The right- 
most figure shows an individual that does not carry the 
Alu on either one of his chromosome (homozygote 
non-Alu), notice that the distance between the reads is 
longer for those reads overlapping the Alu and that no 
reads are mapped inside of the Alu. The center figure 
shows an individual heterozygote for the Alu. 

Figure 2 shows the mapping the output of Alu mate 
and how it can be used to identify an Alu polymorph- 
ism. Black arrows show the reads and their direction. 
Red lines show the insert between the reads. Green 
arrows show / reads and blue arrows show r reads. Left- 
most figure shows an individual carrying no copy of the 
Alu (homozygote non-Alu), notice the absence of / and 
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Figure 1 Example of an Alu deletion. Example of an Alu deletion. Arrows show read directions. Black arrows show normal mapping reads, red 
lines show the insert between them. The leftmost figure shows a normal individual, center an heterozygote and rightmost an individual 
homozygote for an Alu deletion. The location of the Alu is shown with a thick red line in the bottom of each figure. 



r reads. The rightmost figure shows an individual homo- 
zygote for the Alu insertion, notice that the / reads 
occur to the left of the insertion and r reads to the right 
of the insertion and that no reads overlap the insertion. 
The center figure shows an individual carrying a single 
copy of the Alu (heterozyte). 

Detection of deleted Alus 

We consider an Alu sequence deleted when it occurs in 
the reference assembly, but not in the individual(s) 
being sequenced. There are two primary signs of dele- 
tion, some of the reads will be split, containing one part 
from each side of the deletion. The second signal is that 
there are reads that have one end mapping to each of 
the two sides of the Alu being considered and a corre- 
sponding increase in their insert length. The distance 
between these reads, as measured with respect to the 
reference genome will be in expectation be longer than 
Y and should be distributed as Y + l Aiw where l A i w is 
the length of the deleted Alu. Detecting deleted Alus is 
considerably simpler than detecting inserted Alus, as the 
location of the Alu is known. For detecting Alu dele- 
tions we hence only need to consider locations that 
have been already annotated to contain Alus. 
Genotyping deleted Alus 

For each Alu annotated in the reference genome we 
determine the genotypes of the polymorphism of the 
individual. We let Y e be the e percentile of Y and Y X - e be 
the 1 - e percentile of Y, where e is a small constant 
(0.005). At each annotated Alu we consider a window of 
size Yi_ e to the left and right of the estimated Alu. 

We construct a set T consisting of all reads where 
both ends are in a window containing the Alu and Yi_ e 
to the left and right of the Alu. Here / and r are defined 
as before, r if the Alu sequence is to the right of the 



read and 1 if the Alu sequence is to the left. All / and r 
reads falling in that window are realigned to the Alu 
being considered. All reads where only one end maps 
inside the window and are not Alu mates are ignored. 

We then compute the probability of observing the 
insert lengths in T given three different genotype mod- 
els: Homozygote Alu, heterozygote and homozygote non 
Alu. We note that on chromosomes where there is an 
Alu sequence present then the reads mapping with one 
end inside of the Alu and one to the right of the Alu 
and the reads that map with one end to the left of the 
Alu and one inside of it will be independent of each 
other. On chromosomes where there is not an Alu 
sequence present the reads to the left and the right of 
the purported Alu location will be perfectly dependent. 
If our model is that reads are randomly sampled from 
the chromosome the reads can fulfill the criteria of 
belonging to T in one of three ways, each being equally 
likely; From the chromosome carrying the deletion, as a 
read pair mapping with one end inside the Alu and the 
other to the left of the Alu and as a read mapping with 
one end inside the Alu and the other to the right of the 
Alu. For the heterozygote case the probability that a 
read comes from the distribution Y is then |, while the 
probability of coming from Z = Y + l Atu is |, where l A i u 
is the length of the Alu. 

P(data\HomoNonAlu) = P(D|0) = ]~[ y M 

teT 

P{data\Hetero) = P(D\1) = ]~[ (-Z(t) + -Y{t)) 

teT ^ ^ 

P{data\HomoAlu) = P[D\2) = ]~[Z(t) 

teT 



Figure 2 Example of an Alu insertion. Example of an Alu insertion. Arrows show read direction, black arrows show reads mapping normally, 
red lines show the insert between them, green arrows show / reads and blue arrow show r reads. The leftmost figure shows a normal 
individual, center an heterozygote and rightmost an individual homozygote for an Alu insertion. Red dot at the bottom of each figure shows 
the location of the Alu insertion. 

I J 
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Deleted Alus in Multiple Individuals 

When considering multiple individuals simultaneously 
we can construct a likelihood ratio statistic for the 
occurance of the deletion. We let the individuals be 
labeled from 1 through m, D t be the set of sequence 
reads belonging to individual /. We let f 0 be the fre- 
quency of the homozygote Alu carriers in the popula- 
tion, fi be the frequency of the heterozygote and/ 2 be 
the frequency of the homozygote non-Alu. Then the 
joint likelihood of the data given an Alu deletion is: 

JIM CfoPpilO) +/iP(Di|l) +/ 2 P(Di|2)) 

We apply a likelihood ratio test to test whether a dele- 
tion is significantly more likely than the model on the 
statistic 

m m 

2^1og(foP(D f |0) +/iP(Di|l) +f 2 P(Pi\2)) - 2 ^logP(DilO) 

t=i t=i 

Under the null this statistic obeys a chi square distri- 
bution with two degrees of freedom [21]. 

If we assume Hardy- Weinberg equilibrium [22] we 
can estimate the frequency of the Alu deletion, p, on a 
haplotype level. Then the joint likelihood of the data 
given an Alu deletion is: 

iXi K 1 - rt 2p ( D ''l°) + 2p(l - p)P{Di\l) + p 2 P(A|2)) 

The corresponding likelihood ratio test will then obey 
a chi square distribution with one degree of freedom. 
We use the one degree of freedom test in the remainder 
of the paper. 

Inserted Alu identification 

One of the main complications in detecting Alu poly- 
morphisms is the fact that members of the Alu family 
are all highly similar. The Alu insertions which we are 
looking for will be similar to sequences already inserted 
and other sequences that also may have been inserted. 

The mapping of reads not mapping to Alu regions is 
generally more reliable, however a number of problems 
may occur; the region being considered may be dupli- 
cated, or the read may be chimeric, where due to arti- 
facts in the sequencing process two parts of the read 
come from different parts of the genome. This implies 
that not all / and r reads will be close to an actual Alu 
insertion. Some of the reads may also be close to Alus 
already discovered, but the mapping was not discovered 
by BWA, for a further discussion of these issues see 
Additional file 1. We start by finding regions that are 
likely to contain an insertion and then from that list we 
compute a probabilistic model verifying the insertion 
found, first for a single individual and then we extend 
this to multiple individuals. 



Identifying potential inserts 

As described earlier, we label Alu mates as either /, if 
their mapping to the reference genome implies that an 
Alu is to the left of them read or r if their mapping 
implies that an Alu is to the right of them. Each of the 
/ and r reads then gives partial information about the 
location of the Alu read. Given the location of an / 
read an Alu is implied in the region from l r + Y to l r + 
Y + L, where l r is the right endpoint of the / read 
being considered, Y is the distribution of the distances 
between paired-ends and L is the length of a read. 
Similarly, given the location of an r read an Alu is 
implied in the region from 77 - Y to 77 - Y - L, where 77 
is the left endpoint of the r read being considered. 
Some of the reads however may not be correctly 
mapped and should be considered errors. In particular, 
from the mapping of the reads to the reference gen- 
ome we know the number of best mappings of the 
reads in question, a read that has b best mappings will 
with probability \ be mapped correctly. This fact 
means that we can in a simple manor assign weights 
to sequence reads, with a read having b best mappings 
getting weight \. 

We say that an Alu position, a, covers an / read if l r + 
Y x _ e > a and l r + Y e < a + L, where Y e and Y 1 _ e are 
defined as before. Similarly an Alu position, a covers an 
r read if 77 - Y x _ e < a and 77 - Y e > a - L. For each / and 
r read we now want to either cover it with an Alu posi- 
tion or declare it as an error read, we define a constant 
k to be the relative cost between the two. 
Problem 6 

Alu genotyping problem Input A set L of / reads and a 
set R of r reads. 

Output A set A of Alu positions and E of errors. 

Objective min | E \ +k \ A \ 

Constraints Each / e L and r e R is either in E or 
covered by an a e A. 

We note that the most general version of this problem 
reduces to a set covering problem, which can be shown 
to be hard to even approximate [23]. However, as the 
reads are linearly arranged on the chromosome the sets, 
the problem reduces to set covering on interval graphs 
which can be solved in polynomial time using e.g. 
dynamic programming. 

For our empirical evaluations we set k - 3, represent- 
ing that at if three / or r reads are found that can be 
covered by a single Alu insertion we prefer to insert an 
Alu than to assign error labels to these reads. 
Optimal algorithm To search for regions likely to con- 
tain an Alu sequence we make a single pass through the 
genome. For each position, p, we sum the number of r 
reads within a window size Y\_ e to the left p and the 
number of / reads within a window size Y x _ e to the right 
of p. 
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The time complexity of the algorithm is 0(ncY 1 _ e ), 
where n is the length of the genome, c is the coverage. 
w chosen as the size of the largest Alu plus a maximum 
distance between paired-ends under the null distance. 
Regions where this indicator is above a given threshold 
are considered Alu regions. 
Covering multiple individuals 

One way to detect Alu insertions in multiple individuals 
is to pool the data into a single dataset and ignore the 
fact that there are multiple individuals being sequenced. 
This simple idea will however lack power to find infre- 
quent Alus. A region containing multiple / and r reads 
in a single individual is more likely to contain an Alu 
than one that has a single / or r read in multiple indivi- 
duals. We therefore do not want to determine an Alu 
unless there exist some individuals that have multiple / 
or r reads. We let ki and k 2 be constants, representing 
the cost of introducing an Alu insertion to the popula- 
tion and the cost of introducing an Alu insertion to 
each individual. We let A represent the set of Alus and 
for each Alu, y, we let Aj be the set of individuals con- 
taining the Alu. 
Problem 7 

Input A set / of individuals. A set of L t of / reads and a 
set Ri of r reads, for each individual i e /. 

Output A set A of Alu positions and E of errors. 

Objective min | E | +ki | A | +/c 2 | Aj | 

Constraints Each / e L t and r e R t is either in E or 
covered by an a e A and i e A a . 

We have not been able to determine the computa- 
tional complexity of this problem and leave open 
whether or not the problem is NP-hard. 
Heuristic algorithm When tuning these parameters we 
set ki = k 2 = 2, representing that we require two 
sequence reads in each individual to warrant introdu- 
cing a Alu insert in the population and two sequence 
reads to warrant introducing the Alu to the individual. 

We solve this problem using a heuristic. To prune the 
number of regions that we need to consider we start by 
considering each individual at a time. In each individual 
we search for regions where there are at least a small 
number of / and r reads within the same window of size 
2Yi_ e . We then merge the insert locations of two indivi- 
duals if they appear to be very close to each other. 
Genotyping of inserted Alus 

Given the location of potential Alu insertions we run an 
algorithm similar to the one that we ran for Alus that 
are deleted with respect to the reference. 
Until convergence 

Estimate length of Alu insertion 
Re-estimate positions 



Insert the Alu insertion in silico in the position 
determined. 

Apply the algorithm for deleted from reference for 
genotype calling. 

Alu insertion length estimation We assume that there 
is a single insertion event that occurred in all of the 
individuals simultaneously. For each read pair, t, we 
have given a position on the chromosome of the non- 
Alu read, c t1 a position within the Alu of the Alu read 
a t , mean distance between the two, m t and standard 
deviation in distance between the two, s t . The means 
and the standard deviation are estimated from the reads 
of each individual independently. 

Assume we know a position p Atu where there is an 
insertion. Now consider all Alu read pairs in the interval 
\Paiu ' Y i-e> Paiu + Y\. € ]. Now assume that we have 
aligned all Alu read pairs in this interval to the same 
Alu, of length l Atw Our model of the true length of the 
Alu is that it is l Atu + A + p, where A and p are con- 
stants, which can be either positive or negative. A repre- 
sents a left offset in the length of the Alu and p 
represents a right offset in the length of the Alu. 

We now estimate A and p seperately. We start by con- 
sidering all reads pairs with the non-Alu read in [p A i u ' 
PaiiA an d use these to estimate A. Let d t - p A i u - c t , 
then the estimate of A from t is X t - m t - d t - a t) with 
standard deviation s t . When considering multiple reads 
the maximum likelihood estimate of A is then: 



X = 



Similarly we get an estimate for p by setting 
&t = IaIu — a u We now consider all Alu read pairs with a 
non Alu read in the interval \p A i w p A i u + Yi_J and use 
these to get an estimate of p. Let d t - c t - p A i w then the 
estimate of p from t is p t = m t — d t — a~[ with standard 
deviation s t . When considering multiple reads the maxi- 
mum likelihood estimate of p is then: 



P = 



^ s 2 



Alu insert position reestimation 

Each read gives an estimate of the location of the 
inserted Alu. A joint estimate is determined from all of 
the reads in a given region. This is done in the same 
manor as described above, where we isolate p Alu from 
the equations instead of A and p. 
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In silico insertion and deleted algorithm Once the 
location of the Alu insertion and the length of the Alu 
is determined a new sequence is constructed containing 
the Alu at the inserted location. Following the construc- 
tion of this new sequence a graph, identical to the one 
described for Alus deleted with respect to the reference, 
containing the location of the reads in the interval is 
constructed as before. 

The in silico constructed genomic sequence now con- 
tains the Alu that we previously considered to be 
inserted. The Alu sequence is therefore deleted with 
respect to this sequence and we can apply the same 
algorithm as before. 

Results 

We run our experiments on simulated data and on data 
from the 1000 genomes project. 

Simulated data 

We benchmark our algorithms on simulated data. We 
downloaded chromosome 22 of build 37 of the human 
genome, as well as the RepeatMasker track to identify 
Alu sequences in the build. We downloaded a database 
of Alu sequences from RepBase [24]. We selected four 
Alu sequences known to be active in humans; AluYa5, 
AluYb8,AluYb9,AluYkl3; and Alujo, a sequence not 
known to be active. At each location the Alu sequences 
were mutated independently with a 3% uniform muta- 
tion frequency. Each of the five Alus was inserted at ten 
different locations, for a total of 50 Alus inserted. We 
inserted the Alus into 100 different chromosomes. At 
each location we used one of ten different frequencies 
of insertion; 2, 4, 5, 10, 20, 80, 90, 94, 96, 98%. As each 
Alu was inserted into a different number of chromo- 
somes depending on their frequency, each chromosome 
contained on average 25 Alu insertions, ranging from 21 
to 33 Alus inserted into each chromosome. 

The 100 chromosomes where then paired to construct 
50 diploid individuals, with each individual containing 
on average 50 Alu insertions. The Alu insert locations 
were chosen randomly on the chromosome, with the 
constraint that no Alu was added within Y x _ e basepairs 
of another Alu and no more than 1% of basepairs are 
annotated N in a 2Y 1 _ e basepair window surrounding 
the introduced Alu. This allows us to focus our results 
only on Alu insertions that are distant from other Alus 
and is not meant to representative of the process in 
which Alu's are inserted. Reads were simulated using 
the program SimSeq [25]. Reads were simulated inde- 
pendently for each chromosome, with an average of 5x 
coverage per chromosome or lOx coverage per indivi- 
dual. In our experiments 97% of all reads not mapping 
to Alu regions mapped uniquely to the genome, using 



BWA. We simulated our data with both with no error 
and with 2% error. 
Alu insertion 

The set of individuals were selected to have similar cov- 
erage and being genotyped under similar conditions. We 
benchmark our Alu insertion identification algorithm by 
considering the mapping of the reads of the simulated 
individuals to the reference genome, results are shown 
in Table 1. 

We ran our insertion algorithm on each individual 
independently. When tuning our algorithms to find no 
false positives we find 96.4% of all Alus inserted. The 
false negatives are mostly from individuals that are het- 
erozygote for the insertion and are mostly when there is 
other surrounding variation. 
Alu deletion 

We benchmark our Alu deletion identification algorithm 
by considering the mapping of the reads of the simu- 
lated individuals to a simulated individual that contains 
all the Alus, results are shown in Table 2. When tuning 
our algorithms to find no false positives we find 97.7% 
of all Alus deleted. We find deleted Alus in 1390 of the 
1422 locations known to contain an Alu. 

In Additional file 1 we investigate the effects of higher 
error rate on our algorithm. 

Verification on triad data 

We investigated whether the the deletions that we 
detected were transmitted to the children. We simulated 
fifty trios where we independently simulated two chro- 
mosomes with randomly inserted Alus for each parent. 
We then randomly selected one chromosome from each 
parent to use for the child. We found very high concor- 
dance between parent and the child, as shown in Table 3. 

1000 genomes 

We run our experiments on twenty individuals from 
LWK: Luhya in Webuye, Kenya population of the 1000 
genomes project [6,26,27]. 

We find an average of 1418 Alus that are deleted with 
respect to the reference. This corresponds to a rate of 
approximately Alus m the human genome being 
deleted with respect to the reference, a rate comparable 
to the SNP polymorphism rate. A table showing the 
number of Alus deleted with respect to the reference in 



Table 1 Alus inserted with respect to the reference 





Expected 


Found(%) 


Error free 


1512 


1483 (98.1%) 


2% error 


1512 


1446(95.6%) 



Number of Alus found inserted with respect to the reference in simulated 
genotype data. 
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Table 2 Alus deleted with respect to the reference 





Expected 


Found(%) 


Error free 


1422 


1390(97.7%) 


2% error 


1422 


1385(97.4%) 



Number of Alus found deleted with respect to the reference in simulated 
genotype data. 



each individual in the LWK population is shown in 
Additional file 1. 

We find an average of 5990 Alus that are inserted 
with respect to the reference. A table showing the num- 
ber of inserted Alus in each individual in the is shown 
in Additional file 1. 

dbRIP [28] is database containing 2083 Alus known to 
be polymorphic in the human population. On average 
each one of our individuals contains 280 of the Alus 
represented in dbRIP. 

Stewart et al. [29] found a total of 1730 Alus that were 
deleted with respect to the reference and 4499 Alus that 
were inserted with respect to the reference when, con- 
sidering a subset of the 1000 genomes population. The 
individuals considered by Stewart et al. were not the 
same as the ones considered by us. We note that this 
number is lower than we are finding, we have not inves- 
tigated the source of this difference and it may be due 
to the fact that our method is more sensitive or gives 
more false positives. When comparing a single indivi- 
dual to the set of deletions found by Stewart et al we 
find that on average 73.4% of the deleted that we find 
were found in some of the individuals studied by Stew- 
art et al. We find that 7.2% of the inserted Alus that we 
find are found in some of the individuals studied by 
Stewart et al. The high concordance for the deleted case 
is promising. The comparatively lower concordance with 
the inserted Alus may be due to the fact that our algo- 
rithm has a high false positive rate, but also may be due 
to the fact that Alu insertions are of low frequency and 
the population that we study is distantly related from 
the population studied by Stewart et al. 

When we compare the deleted Alus of two individuals 
we found that 61.5% of the deletions found in one indi- 
vidual are also found in another individual. For inserted 
Alus this number is 15.6%. The reason for this differ- 
ence is the fact that Alus generally have a low 



Table 3 Trio results 





Found in child 


Matches parents 


Homozygote deleted 


997 


997 (100%) 


Heterozygote 


368 


362 (98.4%) 



The number of deletions found in child that were also found in a consistent 
manor in its parents. The first line shows when the child is homozygote for 
the deletion. The second line shows the results when the child carries only a 
single copy of the deletion. 



Table 4 Estimated Alu families 



Total 


AluY 


22660(82,15%) 


AluS 


3167(11,48%) 


AluJ 


1 758(6,38%) 



Estimated Alu families of Alus deleted with respect to the reference genome 
using 1000 Genomes data. 



frequency, the deleted Alus are generally the ones that 
have been inserted into the reference genome and hence 
they will not be present in a large number of the other 
individuals, while the inserted have only been inserted 
into a subset of the population. 
Timing 

We ran our computations on desktop machine using a 
single 3.06 GHz Intel i5 processor. On average each 
individual of the 1000 genomes data took lhr and 44 
minutes to analyze regions that are deleted with respect 
to the reference and 2hrs and 1 minute to analyze 
regions that are inserted with respect to the reference. 

Alu families 

We investigate which Alu families are deleted. We esti- 
mate the Alu family from the repeat masker annotations 
(cf. Table 4). Using these annotations 82.15% of the 
deletions are found to belong to the AluY family. This 
family is believed to be the family most polymorphic in 
humans [15]. We also find that 11.48% of the deletions 
that we find belong to the AluS family and 6.38% belong 
to the AluJ family. 

Conclusions 

A number of improvements can be made to the the algo- 
rithm that we have presented. Broken reads, those where 
one part maps to the reference genome and one part maps 
to an insertion or where one part maps to one side of an 
deletion and one part to the other, can be used to improve 
the algorithms described here. In our algorithm we study 
only the single best mapping of each sequence read. An 
alternative would be to study multiple mapping of reads to 
the reference genome. We will attempt to explore such 
solutions, however our experimental results suggests that 
this will provide little gain for most regions of the genome 
with considerable added algorithmic complexity. Our 
future goals are to extend the methods developed here to 
find other types of structural variations. 

Additional material 



Additional file 1: Supplementary material contains a more detailed 
description of our methods, additional simulation results and 
results on the 1000 genomes data. 
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